Add initial Telemetry API with Tracing support for sampling and custom tags#1740
Conversation
|
😊 Welcome @douglas-reid! This is either your first contribution to the Istio api repo, or it's been You can learn more about the Istio working groups, code of conduct, and contributing guidelines Thanks for contributing! Courtesy of your friendly welcome wagon. |
|
Skipping CI for Draft Pull Request. |
|
/test all |
mesh/v1alpha1/config.proto
Outdated
| // $hide_from_docs | ||
| // Next available field number: 58 | ||
| message TelemetryProvider { | ||
| // REQUIRED. A unique name identifying the telemetry provider. |
There was a problem hiding this comment.
The name must be unique. The name is used to refer to a provider from the telemetry api, so it should also be meaningful.
|
Ping! Folks we need to keep this moving. |
I'm working on addressing initial comments. Will push an update today. |
telemetry/v1alpha1/telemetry.proto
Outdated
| // | ||
| // For resources with a workload-selector, it is only valid to have one resource selecting | ||
| // any given workload. If multiple resources select a single resource, the oldest known | ||
| // resource will be used to the exclusion of all other resources. |
There was a problem hiding this comment.
Shouldn't be 'the most specific' ?
There was a problem hiding this comment.
This is addressing the issue of multiple resources at the same level - there would be no 'most specific'. I'll update the text.
|
@douglas-reid are you going to scope this PR down or hide APIs so only the Tracing APIs are exposed? I know we want to target tracing for 1.9 so I'm asking this question. |
|
We should hide the whole thing, and unhide parts as the impl is checked in. |
Right now, other than the providers bits, this is entirely Tracing oriented (with commented out placeholders for metrics/logs). Should I remove the non-tracing related provider config too? |
I'm fine with Mandar's proposal too as long as we only unhide APIs that are implemented. Keeping a lot of unimplemented APIs spanning multiple releases can be a bit of nuisance. |
|
What is missing ? We can add a 'name' to the old services, or we can use
default names for the old-style
services ( I think we only supported one instance ). And merge it with
services defined using the new
extension provider.
I'm not even sure we must do that - for the first release it is perfectly
fine to keep 'one provider' or 'one provider
of each type' and focus on the rest of the API.
I don't think 'user should not have been manually editing' or 'change the
install behavior' is valid - we know
a lot of users customized the install, we want to support skip version
upgrades - and most of the values.yaml
is even less stable or supported than MeshConfig.
…On Thu, Nov 19, 2020 at 11:06 AM Douglas Reid ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In mesh/v1alpha1/config.proto
<#1740 (comment)>:
> + // Optional. Configures max number of edges per batch in requests.
+ // $hide_from_docs
+ google.protobuf.Int32Value max_edges_batch_size = 12;
+ }
+
+ // Configuration for Envoy's gRPC [Access Log Service](https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/accesslog/v2/als.proto)
+ // $hide_from_docs
+ message RemoteLoggingConfig {}
+
+ // Configuration for Envoy's gRPC [Metric Service](https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/metrics/v2/metrics_service.proto)
+ // $hide_from_docs
+ message RemoteMetricsConfig {}
+ }
+
+ // $hide_from_docs
+ repeated ExtensionProvider telemetry_providers = 58;
We can add to extension_providers as both efforts move out of development
mode. The concern is causing issues for development of that feature, as
these are evolving in parallel. Both are hidden from docs and not
advertised initially.
As for moving fields: we need a way to refer to telemetry
backends/services in the telemetry resources. Some of these config fields
may have been around for a bit, but they were never considered stable.
Users should not have been manually editing the mesh config fields.
Changing the install behavior and automating conversion is something that
can be done.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1740 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAUR2QKDU22L3FGTWGNWOTSQVUCDANCNFSM4TQCBINA>
.
|
|
On Thu, Nov 19, 2020 at 12:23 PM Douglas Reid ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In telemetry/v1alpha1/telemetry.proto
<#1740 (comment)>:
> + // Optional. Workload selector decides where to apply the Telemetry policy.
+ // If not set, the Telemetry policy will be applied to all workloads in the
+ // same namespace as the Telemetry policy.
+ istio.type.v1beta1.WorkloadSelector workload_selector = 1;
+
+ // Optional. Tracing defines the per-workload overrides for trace span
+ // reporting.
+ repeated TracingRule tracing = 2;
+
+ // AccessLoggingRule access_logging = 3;
+ // MetricsRule metrics = 4;
+}
+
+// TracingRule defines how trace spans should be reported (sampling rate, custom tags)
+// and under what conditions the reporting should be conducted.
+message TracingRule {
We don't, but these are the names of the proto messages. They aren't
exposed to end users editing config. They do add specificity and clarity
for development. Here we have both TracingRule and TracingConfig
messages. We can't call them both Tracing. If we drop one, I think it
would be preferable to drop Config from the TracingConfig name. Thoughts?
Maybe it's my english - but to me all are configs, and all configs provide
some 'rule' or policy.
I'm not quite sure what is the proper use of 'rule' vs 'policy' - I heard
about "policy rules" in
other contexts. But I can't imagine anything in a config file not a
'config', and all
configs are creating a rule (or policy) for the thing it is configuring.
And we also have the top level Policy configs, which set security rules.
But I'll defer to
native english speakers. To me it's confusing and redundant.
|
|
On Thu, Nov 19, 2020 at 12:27 PM Douglas Reid ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In telemetry/v1alpha1/telemetry.proto
<#1740 (comment)>:
> + // Note: Use OUTBOUND for gateways (even including ingress)
+ OUTBOUND = 1;
+ // Match incoming traffic for the proxy. Use this to select "server-side"
+ // traffic in telemetry reporting.
+ INBOUND = 2;
+ }
+
+ // Protocol selects for traffic based on the identified protocol of that traffic.
+ enum Protocol {
+ // (Default) Matches all traffic, regardless of protocol
+ ALL_PROTOCOLS = 0;
+ // Selects for HTTP traffic, including HTTP/1.1, gRPC, HTTP/2.
+ HTTP = 1;
+ // Selects for all non-HTTP traffic.
+ // NOTE: Tracing is currently only supported for HTTP.
+ TCP = 2;
We do need TCP - at least with the way things work today. Telemetry is
handled differently for TCP in Istio (different metrics, no tracing,
different access logs). At the namespace or root config level, we will need
to be able to specify policy at the protocol level -- without knowing which
ports will be involved.
TCP-over-HTTP will still be need to be handled as TCP traffic for
telemetry purposes, I imagine. We would want to report on connections,
etc., right?
I don't know... It may be possible in some cases, but it's a fine line.
Few protocols now have first-class 'over H2' or 'over WS' ( for example
SSH-over-websocket).
And if you mix the ALPN and other handshakes - plus upgrading/downgrading
of protocols - it becomes
more and more confusing.
I understand current implementation is what it is, but maybe we don't need
to express it in the API
and tie our hands.
IMO everything is 'streams with metadata'.
|
|
On Thu, Nov 19, 2020 at 12:37 PM Douglas Reid ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In telemetry/v1alpha1/telemetry.proto
<#1740 (comment)>:
> +}
+
+// Used to bind Telemetry configuration to specific providers for
+// targeted customization.
+message ProviderRef {
+ // Required. Name of Telemetry provider in MeshConfig.
+ string name = 1;
+}
+
+// TracingConfig defines the workload-level overrides for tracing behavior within
+// a mesh. It can be used to enable/disable tracing, as well as to set sampling
+// rates and custom tag extraction.
+message TracingConfig {
+ // Required. Name of providers to which this configuration should apply. At
+ // least one provider needs to be specified.
+ repeated ProviderRef providers = 1;
@nrjpoddar <https://github.com/nrjpoddar> we initially wanted to have
this be repeated, but for any given envoy hcm there can only be a single
provider specified. Should we restrict this to one here?
If you can do it via doc. Envoy may change, or someone may write a WASM
provider that supports this, etc.
I think it's reasonable to document that the initial implementation only
supports 1, or 1 of each kind.
… —
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1740 (review)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAUR2VLYLY7AKXENA3MWDLSQV6ZVANCNFSM4TQCBINA>
.
|
smawson
left a comment
There was a problem hiding this comment.
Came here to hit the big green approve button but I see Louis has some lingering comments so I'll let him hit the button :)
There was a problem hiding this comment.
@dougreid this is looking good!
One overall comment:
All the providers seem to have few common fields related to address/port which are copied everywhere. Should we think about creating a type for it?
May be this was explored in conversations above so just me to it instead of rehashing.
Proto doesn't have a way to have "embedded" types like golang, so this means that we would need it nested which I don't think adds too much value. While it looks weird in the proto, a user doesn't see that - they just see that all providers have a "address" field which makes intuitive sense. So personally I think the way we have it now makes sense. |
I understand Proto limitations. Since these fields collectively describe the transport/cluster settings clubbing them together under a type does not feel odd to me. I just don’t think it’s good API practice to keep fields across growing list of messages consistent by copying/pasting. |
There was a problem hiding this comment.
@douglas-reid is max length supported by all currently listed providers?
telemetry/v1alpha1/telemetry.proto
Outdated
| // same namespace as the Telemetry policy. | ||
| istio.type.v1beta1.WorkloadSelector selector = 1; | ||
|
|
||
| // Optional. Tracing defines the per-workload overrides for trace span |
There was a problem hiding this comment.
This cannot be optional I’m assuming?
There was a problem hiding this comment.
Using per-workload here might be misguided as you can use it to refer to all workloads in namespace or mesh if selectors are omitted.
There was a problem hiding this comment.
We can add back in the "Optional" bit later, if that helps. This is anticipating adding in Metrics and AccessLogging (in which case you will not be required to add Tracing configuration).
There was a problem hiding this comment.
Got it but I would prefer to make comments mimic the current implementation and not the future.
| // will be used. | ||
| // NOTE: At the moment, only a single provider can be specified in a given | ||
| // Tracing rule. | ||
| repeated ProviderRef providers = 2; |
There was a problem hiding this comment.
What’s the behavior if this resource is created in istio-system or root config namespace with no workload selectors and configures a provider? Does that override the MeshConfig setting?
There was a problem hiding this comment.
Is this only a prob during transition phase when both exist? I assume the config in mesh config will be eventually removed?
There was a problem hiding this comment.
@nrjpoddar it overrides the meshconfig setting for all matches that "fallback" to root config namespace (a result of the full replacement semantics). since the most specific match "wins", we have the following scenarios:
- mesh config default + root config ns resource with provider = provider from root config ns resource is used in the mesh
- mesh config default + root config ns resource with provider + workload-specific resource without provider = provider from the mesh config default is used.
There was a problem hiding this comment.
One option is to not allow provider reference in the root config namespace.
There was a problem hiding this comment.
The current UX
- mesh config default + root config ns resource with provider = provider from root config ns resource is used in the mesh
- mesh config default + root config ns resource with provider + workload-specific resource without provider = provider from the mesh config default is used.
is quite confusing I think. @louiscryan @smawson?
There was a problem hiding this comment.
This is definitely confusing -- and could use refinement.
Other potential issues (more important for metrics/access logging): enabling multiple providers but requiring different configuration (example: send byte distribution metrics to prometheus but not to cloud monitoring). With most specific match winning, there isn't a way to support that. We could make providers a secondary part of the match condition (allowing multiple rules at the same specificity level) - at the cost of more complexity.
Tracing is fundamentally different than metrics/access-logs in this regard (currently). You can't configure two different tracing providers for a single HCM. But we could have multiple metrics-generating filters and access log is a repeated field in HCM.
One way forward may be to separate provider-scoped configuration (say some custom tags are only sent to one tracing backend for this workload) from provider selection. Even still, we'd have to figure out how to handle inheritance of selection.
There was a problem hiding this comment.
We should favor replace-only options, it means if you specify a list of provider, you will replace the stuff above you. If you don't specify then you use walk up the chain. We need to define what are those indivisible messages that are replaced.
Replace semantics on top level Telemetry message results in a confusing UX, so we need to lower it a level.
In Tracing for example, providers and everything else seems like a good division.
There was a problem hiding this comment.
Updated documentation based on conversation recorded in: https://docs.google.com/document/d/1o3rH_C9sWM9Y2xAFfkvlHXEnF6vBR3OfkklHkwjwTAc/edit?resourcekey=0-4e-YSD_7fwtCqQuNnhiNEQ#heading=h.xw1gqgyqs5b
I don't believe that the decision there necessitates any API changes, but rather implementation-side merging.
telemetry/v1alpha1/telemetry.proto
Outdated
| // same namespace as the Telemetry policy. | ||
| istio.type.v1beta1.WorkloadSelector selector = 1; | ||
|
|
||
| // Optional. Tracing defines the per-workload overrides for trace span |
There was a problem hiding this comment.
Using per-workload here might be misguided as you can use it to refer to all workloads in namespace or mesh if selectors are omitted.
| // | ||
| // Telemetry configuration does not inherit. Rather, matched configurations must fully-specify | ||
| // the desired behavior. Workload-specific configuration fully replaces any configuration | ||
| // defined at the local and root namespace levels. |
There was a problem hiding this comment.
Is this intended going forward or just a limitation atm?
There was a problem hiding this comment.
With DR inheritance and VS referencing we now have a mixed bag of overrides, inheritance and references. @louiscryan @smawson @linsun we should reconcile when one make sense over the other.
Not a blocker for this API though.
| // spec: | ||
| // selector: | ||
| // labels: | ||
| // service.istio.io/canonical-name: foo |
There was a problem hiding this comment.
would be really good to explain if the foo service should also have this label or just the app: foo label?
There was a problem hiding this comment.
this label is auto-added for all workloads in istio (without requiring user action). i'd like to avoid trying to document that functionality in example configuration, as it is just an example of a label, if that's ok.
There was a problem hiding this comment.
This is a workload bound configuration. It is explicitly not bound to a service. And “canonical service” gives it exclusivity.
| // For any namespace, including the root configuration namespace, it is only valid | ||
| // to have a single workload selector-less Telemetry resource. |
There was a problem hiding this comment.
just to make sure I understand this, are you saying if I apply foo-tracing in bar ns and foo-tracing-alternative also in bar ns (your examples below), it is not allowed?
There was a problem hiding this comment.
That is correct. The examples below aren't meant to be used at the same time, but I can change the ns in one if it helps.
There was a problem hiding this comment.
Updated ns in one example.
|
Overall I think with the addition of default provider in MeshConfig we have made a mix of inheritance and override semantics. I’m worried about the case when a Resource defines a provider in root namespace which is different from MeshConfig default. |
Yes. It is specified outside of the provider configuration in Envoy: https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/network/http_connection_manager/v3/http_connection_manager.proto.html?highlight=max_path_tag_length#extensions-filters-network-http-connection-manager-v3-httpconnectionmanager-tracing |
I think there are plenty of ways to refactor the providers configuration. Perhaps in the follow-on move to separate providers into their own proto files, we can explore ways to expose the common fields? I'm also worried about duplicating fields all over the place. |
|
Adding do-no-merge label to prevent accidental merge until we figure out "defaulting" semantics. |
@nrjpoddar do we think this is ready now? |
nrjpoddar
left a comment
There was a problem hiding this comment.
Looks good. Thanks for working through the various suggestions and feedback. I’m still not convinced how you will refactor provider fields to avoid dull location without requiring a migration for users. But I’m fine merging this if you think there’s a way to scope the duplicated fields and manage it better long term.
nrjpoddar
left a comment
There was a problem hiding this comment.
I’m guessing you will add a separate PR which adds comments to explain merging semantics.
…m tags (istio#1740) * Add initial Telemetry API definition * fix names and comments * Rename TracingConfig to Tracing * Remove TelemetryProviders and consolidate to just Tracing providers * add new extension providers to oneof * Remove providers from telemetry API * Add release note * Address comments * Add back providers, remove match, simplify Trace API * Collapse TracingRule into Tracing and remove deprecations in ProxyConfig * Move from address to service + port in providers * Remove exclude_mesh_tags * Revert to boolean control of span reporting * Cleanup documentation * Replace subdomain with telemetry type
…m tags (istio#1740) * Add initial Telemetry API definition * fix names and comments * Rename TracingConfig to Tracing * Remove TelemetryProviders and consolidate to just Tracing providers * add new extension providers to oneof * Remove providers from telemetry API * Add release note * Address comments * Add back providers, remove match, simplify Trace API * Collapse TracingRule into Tracing and remove deprecations in ProxyConfig * Move from address to service + port in providers * Remove exclude_mesh_tags * Revert to boolean control of span reporting * Cleanup documentation * Replace subdomain with telemetry type
This PR adds support for new
Telemetryresource. In particular, it addsTracingrules to enable workload-specific customization of tracing behavior. This is based on design discussions from Proposal: Telemetry API document (and the RFCs that preceded it).The purpose of the new
Telemetryresource is to enable a number of use cases around telemetry generation customization through a stable, supportable API that is much more operator-friendly than the currentEnvoyFilter-based approaches in use today. Fuller justification of the PR (and the issues that is attempting to address) can be found in the proposal.The
Telemetryresource will hold provider-agnostic telemetry override configuration for workloads. It is meant to be used by Application Operators to customize telemetry generation for their workloads. In this PR, theTelemetryresource only configures tracing behavior to limit the overall scope and discussion while still providing insight into how the two additions will work together. Logging and metrics support will be quick-follows upon approval of this PR.As the new resources mature, and full support is added to the control plane, subsequent PRs will be required to un-hide these changes in the docs and to label the various bits of
MeshConfigandProxyConfigthat are being replaced as deprecated.